{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img alt=\"arize llama-index logos\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png\" width=\"400\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LlamaIndex Chunk Size, Retrieval Method and K Eval Suite\n",
    "\n",
    "This colab provides a suite of retrieval performance tests that helps teams understand\n",
    "how to setup the retrieval system. It makes use of the Phoenix Eval options for \n",
    "Q&A (overall did it answer the question) and retrieval (did the right chunks get returned).\n",
    "\n",
    "There is a sweep of parameters that is stored in experiment_data/results_no_zero_remove, \n",
    "check that directory for results. \n",
    "\n",
    "The goal is to help teams choose a Chunk size, retireval method, K for return chunks\n",
    "\n",
    "This colab downloads the script (py) files. Those files can be run without this colab directly,\n",
    "in a code only environment (VS code for example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieval Eval\n",
    "\n",
    "This Eval evaluates whether a retrieved chunk contains an answer to the query. Its extremely useful for evaluating retrieval systems.\n",
    "\n",
    "https://docs.arize.com/phoenix/concepts/llm-evals/retrieval-rag-relevance\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Q&A EVal\n",
    "This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.\n",
    "\n",
    "https://docs.arize.com/phoenix/concepts/llm-evals/q-and-a-on-retrieved-data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/chunking.png\" width=800/>\n",
    "    </p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The challenge in setting up a retrieval system is having solid performance metrics that allow you to evaluate your different strategies:\n",
    "- Chunk Size\n",
    "- Retrieval Method\n",
    "- K value\n",
    "\n",
    "In setting the above variables you first need some overall Eval metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval_relevance.png\" width=800/>\n",
    "    </p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above is the relevance evaluation used to check whether the chunk retrieved is relevant to the query."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/EvalQ_A.png\" width=600/>\n",
    "    </p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above Eval shows a Q&A Eval on the entire system Q&A /\n",
    "on the overall question and answer. \n",
    "Each is used as we sweep through parameters to detremine effectiveness of retrieval."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sweeping values\n",
    "The scripts sweep through K, Retrival approach and chunk size, determining the trade off on your own docs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_k.png\" width=800/>\n",
    "    </p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above shows sweeping through K=4 and K=6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_chunk.png\" width=800/>\n",
    "    </p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above shows sweeping through Chunk Size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The script below runs a test on the question set, by default we have a 170 Question set\n",
    "# That takes some time to run so you can default it lower just to test\n",
    "# Comment this out to run on full dataset\n",
    "QUESTION_SAMPLES = 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install cohere matplotlib lxml openai 'arize-phoenix>=3.4' 'llama_index>0.10.3' 'openinference-instrumentation-llama-index>1.0.0 llama-index-callbacks-arize-phoenix' bs4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retrieval Eval Scripts \n",
    "The following scripts can be run directly. In the case of long test suites, we recommend running \n",
    "the python script llama_index_w_evals_and_qa directly.py directly in python. All parameters are available \n",
    "in that script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Download scripts\n",
    "import requests\n",
    "\n",
    "url = \"https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/llama_index_w_evals_and_qa.py\"\n",
    "response = requests.get(url)\n",
    "with open(\"llama_index_w_evals_and_qa.py\", \"w\") as file:\n",
    "    file.write(response.text)\n",
    "\n",
    "url = \"https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/plotresults.py\"\n",
    "response = requests.get(url)\n",
    "with open(\"plotresults.py\", \"w\") as file:\n",
    "    file.write(response.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import os\n",
    "import pickle\n",
    "\n",
    "import cohere\n",
    "import openai\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Phoenix Observabiility\n",
    "Click link below to visualize llamaIndex queries and chunking as its happening!!!!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#########################################\n",
    "### CLICK LINK BELOW FOR PHOENIX VIZ ####\n",
    "#########################################\n",
    "# Phoenix can display in real time the traces automatically\n",
    "# collected from your LlamaIndex application.\n",
    "import phoenix as px\n",
    "import phoenix.experimental.evals.templates.default_templates as templates\n",
    "from llama_index import download_loader\n",
    "from llama_index_w_evals_and_qa import get_urls, plot_graphs, run_experiments\n",
    "from phoenix.experimental.evals import OpenAIModel\n",
    "\n",
    "# Look for a URL in the output to open the App in a browser.\n",
    "px.launch_app()\n",
    "# The App is initially empty, but as you proceed with the steps below,\n",
    "# traces will appear automatically as your LlamaIndex application runs.\n",
    "\n",
    "import llama_index\n",
    "\n",
    "llama_index.set_global_handler(\"arize_phoenix\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
    "cohere.api_key = os.getenv(\"COHERE_API_KEY\")\n",
    "\n",
    "# if loading from scratch, change these below\n",
    "web_title = \"arize\"  # nickname for this website, used for saving purposes\n",
    "base_url = \"https://docs.arize.com/arize\"\n",
    "# Local files\n",
    "file_name = \"raw_documents.pkl\"\n",
    "save_base = \"./experiment_data/\"\n",
    "if not os.path.exists(save_base):\n",
    "    os.makedirs(save_base)\n",
    "\n",
    "run_name = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n",
    "save_dir = os.path.join(save_base, run_name)\n",
    "if not os.path.exists(save_dir):\n",
    "    # Create a new directory because it does not exist\n",
    "    os.makedirs(save_dir)\n",
    "\n",
    "\n",
    "questions = pd.read_csv(\n",
    "    \"https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv\",\n",
    "    header=None,\n",
    ")[0].to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This will determine run time, how many questions to pull from the data to run\n",
    "selected_questions = questions[:QUESTION_SAMPLES] if QUESTION_SAMPLES else questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install --force-reinstall urllib3\n",
    "!pip uninstall urllib3 -y\n",
    "!pip install urllib3==2.0.4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_docs_filepath = os.path.join(save_base, file_name)\n",
    "if not os.path.exists(raw_docs_filepath):\n",
    "    print(f\"'{raw_docs_filepath}' does not exists.\")\n",
    "    urls = get_urls(base_url)  # you need to - pip install lxml\n",
    "    print(f\"LOADED {len(urls)} URLS\")\n",
    "\n",
    "print(\"GRABBING DOCUMENTS\")\n",
    "BeautifulSoupWebReader = download_loader(\"BeautifulSoupWebReader\")\n",
    "# two options here, either get the documents from scratch or load one from disk\n",
    "if not os.path.exists(raw_docs_filepath):\n",
    "    print(\"LOADING DOCUMENTS FROM URLS\")\n",
    "    # You need to 'pip install lxml'\n",
    "    loader = BeautifulSoupWebReader()\n",
    "    documents = loader.load_data(urls=urls)  # may take some time\n",
    "    with open(save_base + file_name, \"wb\") as file:\n",
    "        pickle.dump(documents, file)\n",
    "    print(\"Documents saved to raw_documents.pkl\")\n",
    "else:\n",
    "    print(\"LOADING DOCUMENTS FROM FILE\")\n",
    "    print(\"Opening raw_documents.pkl\")\n",
    "    with open(save_base + file_name, \"rb\") as file:\n",
    "        documents = pickle.load(file)\n",
    "##############################\n",
    "### PARAMETER SWEEPS BELOW ###\n",
    "##############################\n",
    "###chunk_sizes### to test, will sweep through values of chunk size\n",
    "chunk_sizes = [\n",
    "    100,\n",
    "    # 300,\n",
    "    # 500,\n",
    "    # 1000,\n",
    "    # 2000,\n",
    "]  # change this, perhaps experiment from 500 to 3000 in increments of 500\n",
    "\n",
    "### K ###: Sizes to test, will sweep through values of k\n",
    "k = [4, 6, 10]\n",
    "# k = [10]  # num documents to retrieve\n",
    "\n",
    "### Retrieval Approach ###: transformation to test will sweep through retrieval\n",
    "# transformations = [\"original\", \"original_rerank\",\"hyde\", \"hyde_rerank\"]\n",
    "transformations = [\"original\", \"original_rerank\"]\n",
    "# Model for Q&A\n",
    "llama_index_model = \"gpt-4\"\n",
    "# llama_index_model = \"gpt-3.5-turbo\"\n",
    "# Model for Evals\n",
    "eval_model = OpenAIModel(model_name=\"gpt-4\", temperature=0.0)\n",
    "\n",
    "qa_template = templates.QA_PROMPT_TEMPLATE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment when testing, 3 questions are easy to run through quickly\n",
    "# questions = questions[0:3]\n",
    "all_data = run_experiments(\n",
    "    documents=documents,\n",
    "    queries=questions,\n",
    "    chunk_sizes=chunk_sizes,\n",
    "    query_transformations=transformations,\n",
    "    k_values=k,\n",
    "    web_title=web_title,\n",
    "    save_dir=save_dir,\n",
    "    llama_index_model=llama_index_model,\n",
    "    eval_model=eval_model,\n",
    "    template=qa_template,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_data_filepath = os.path.join(save_dir, f\"{web_title}_all_data.pkl\")\n",
    "with open(all_data_filepath, \"wb\") as f:\n",
    "    pickle.dump(all_data, f)\n",
    "\n",
    "# The retrievals with 0 relevant context really can't be optimized, removing gives a diff view\n",
    "plot_graphs(\n",
    "    all_data=all_data,\n",
    "    save_dir=os.path.join(save_dir, \"results_zero_removed\"),\n",
    "    show=False,\n",
    "    remove_zero=True,\n",
    ")\n",
    "plot_graphs(\n",
    "    all_data=all_data,\n",
    "    save_dir=os.path.join(save_dir, \"results_zero_not_removed\"),\n",
    "    show=False,\n",
    "    remove_zero=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Results Q&A Evals (actual results in experiment_data)\n",
    "\n",
    "The Q&A Eval runs at the highest level of did you get the question answer correct  based on the data:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/percentage_incorrect_plot.png\" />\n",
    "    </p>\n",
    "</center>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Results Retrieval Eval  (actual results in experiment_data)\n",
    "\n",
    "The retrieval analysis example is below, iterates through the chunk sizes, K (4/6/10), retrieval method\n",
    "The eval checks whether the retrieved chunk is relevant and has a chance to answer the question"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/all_mean_precisions.png\" />\n",
    "    </p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Results Latency  (actual results in experiment_data)\n",
    "\n",
    "The latency can highly varied based on retrieval approaches, below are latency maps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/median_latency_all.png\" />\n",
    "    </p>\n",
    "</center>\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
